Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [https://machinelearningmastery.com/]

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Insurance Company Benchmark dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data set was used in the CoIL 2000 Challenge that contains information on customers of an insurance company. The data consist of 86 variables and include product usage data and socio-demographic data derived from zip codes.

The data was supplied by the Dutch data mining company Sentient Machine Research and is based on a real-world business problem. The training set contains over 5000 descriptions of customers, including the information of whether they have a caravan insurance policy. A test dataset contains another 4000 customers whose information will be used to test the effectiveness of the machine learning models.

The insurance organization collected the data to answer the following question: Can we predict who would be interested in buying a caravan insurance policy and give an explanation why?

ANALYSIS: The baseline performance of the seven algorithms achieved an average ROC score of 0.6965. Two algorithms, Decision Tree and Random Forest, achieved the top two ROC scores after the first round of modeling. After a series of tuning trials, Random Forest yielded the top result using the training data. It achieved an ROC score of 0.7159. After using the optimized tuning parameters, the Random Forest algorithm processed the validation dataset with an ROC score of 0.5285, which was significant below the result from the training data.

CONCLUSION: For this iteration, the Random Forest algorithm achieved the leading ROC scores using the training and validation datasets. For this dataset, the Random Forest algorithm does not appear to be sufficiently adequate for production use. Further modeling and testing is recommended for the next step.

Dataset Used: Insurance Company Benchmark (COIL 2000) Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+(COIL+2000)

One potential source of performance benchmark: https://www.kaggle.com/uciml/caravan-insurance-challenge

The project aims to touch on the following areas:

  1. Document a predictive modeling problem end-to-end.
  2. Explore data cleaning and transformation options
  3. Explore non-ensemble and ensemble algorithms for baseline model performance
  4. Explore algorithm tuning techniques for improving model performance

Any predictive modeling machine learning project genrally can be broken down into about six major tasks:

  1. Prepare Problem
  2. Summarize Data
  3. Prepare Data
  4. Model and Evaluate Algorithms
  5. Improve Accuracy or Results
  6. Finalize Model and Present Results

1. Prepare Problem

1.a) Load libraries

startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded
library(mailR)
library(parallel)
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(stringr)
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## The following object is masked from 'package:base':
## 
##     Recall
# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)

1.b) Set up the email notification function

email_notify <- function(msg=""){
  sender <- "luozhi2488@gmail.com"
  receiver <- "dave@contactdavidlowe.com"
  sbj_line <- "Notification from R Script"
  password <- readLines("../email_credential.txt")
  send.mail(
    from = sender,
    to = receiver,
    subject= sbj_line,
    body = msg,
    smtp = list(host.name = "smtp.gmail.com", port = 465, user.name = sender, passwd = password, ssl = TRUE),
    authenticate = TRUE,
    send = TRUE)
}
email_notify(paste("Library and Data Loading has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@47fd17e3}"

1.c) Load dataset

# Read the list of attribute names from a file
attrFile = "TicAttributes.txt"
conn <- file(attrFile, open="r")
lines <- readLines(conn)
close(conn)
colNames <- c()
for (i in 1:length(lines)) {
  colNames <- c(colNames,word(lines[i]))
}

# Import the records for the training dataset
inputFile = "ticdata2000.txt"
xy_train <- read.csv(inputFile, header = FALSE, sep = "\t", col.names = colNames)

# Standardize the class column to the name of targetVar
xy_train$targetVar <- "Yes"
xy_train$targetVar[xy_train$CARAVAN==0] <- "No"
xy_train$targetVar <- as.factor(xy_train$targetVar)
xy_train$targetVar <- relevel(xy_train$targetVar, "Yes")
xy_train$CARAVAN <- NULL
cat("Number of training rows and columns imported into xy_train:", nrow(xy_train), "by", ncol(xy_train), "\n")
## Number of training rows and columns imported into xy_train: 5822 by 86
# Import the records for the test/eval dataset without the target variable
noTargetCol <- colNames[-length(colNames)]
inputFile = "ticeval2000.txt"
x_test <- read.csv(inputFile, header = FALSE, sep = "\t", col.names = noTargetCol)
cat("Number of training rows and columns imported into x_test:", nrow(x_test), "by", ncol(x_test), "\n")
## Number of training rows and columns imported into x_test: 4000 by 85
# Import the records for the test/eval dataset with only the target variable
inputFile = "tictgts2000.txt"
y_test <- read.csv(inputFile, header = FALSE, col.names = c("CARAVAN"))
y_test$targetVar <- "Yes"
y_test$targetVar[y_test$CARAVAN==0] <- "No"
y_test$targetVar <- as.factor(y_test$targetVar)
y_test$targetVar <- relevel(y_test$targetVar, "Yes")
y_test$CARAVAN <- NULL
cat("Number of training rows and columns imported into y_test:", nrow(y_test), "by", ncol(y_test), "\n")
## Number of training rows and columns imported into y_test: 4000 by 1
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(xy_train)

# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization! 
targetCol <- totCol
# We create training datasets (xy_train, x_train, y_train) for various operations.
# We create validation datasets (xy_test, x_test, y_test) for various operations.
set.seed(seedNum)

# Create a list of the rows in the original dataset we can use for training
# training_index <- createDataPartition(originalDataset$targetVar, p=0.70, list=FALSE)
# Use 70% of the data to train the models and the remaining for testing/validation
# xy_train <- originalDataset[training_index,]
# xy_test <- originalDataset[-training_index,]

if (targetCol==1) {
  x_train <- xy_train[,(targetCol+1):totCol]
  y_train <- xy_train[,targetCol]
  xy_test <- cbind(y_test, x_test)
  y_test <- xy_test[,targetCol]
} else {
  x_train <- xy_train[,1:(totAttr)]
  y_train <- xy_train[,totCol]
  xy_test <- cbind(x_test, y_test)
  y_test <- xy_test[,targetCol]
}

1.d) Set up the key parameters to be used in the script

# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 5
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row):  5  by  17

1.e) Set test options and evaluation metric

# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1, classProbs=TRUE, summaryFunction=twoClassSummary)
metricTarget <- "ROC"
email_notify(paste("Library and Data Loading completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@34340fab}"

2. Summarize Data

To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.

email_notify(paste("Data Summarization and Visualization has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@546a03af}"

2.a) Descriptive statistics

2.a.i) Peek at the data itself.

head(xy_train)
##   MOSTYPE MAANTHUI MGEMOMV MGEMLEEF MOSHOOFD MGODRK MGODPR MGODOV MGODGE
## 1      33        1       3        2        8      0      5      1      3
## 2      37        1       2        2        8      1      4      1      4
## 3      37        1       2        2        8      0      4      2      4
## 4       9        1       3        3        3      2      3      2      4
## 5      40        1       4        2       10      1      4      1      4
## 6      23        1       2        1        5      0      5      0      5
##   MRELGE MRELSA MRELOV MFALLEEN MFGEKIND MFWEKIND MOPLHOOG MOPLMIDD
## 1      7      0      2        1        2        6        1        2
## 2      6      2      2        0        4        5        0        5
## 3      3      2      4        4        4        2        0        5
## 4      5      2      2        2        3        4        3        4
## 5      7      1      2        2        4        4        5        4
## 6      0      6      3        3        5        2        0        5
##   MOPLLAAG MBERHOOG MBERZELF MBERBOER MBERMIDD MBERARBG MBERARBO MSKA
## 1        7        1        0        1        2        5        2    1
## 2        4        0        0        0        5        0        4    0
## 3        4        0        0        0        7        0        2    0
## 4        2        4        0        0        3        1        2    3
## 5        0        0        5        4        0        0        0    9
## 6        4        2        0        0        4        2        2    2
##   MSKB1 MSKB2 MSKC MSKD MHHUUR MHKOOP MAUT1 MAUT2 MAUT0 MZFONDS MZPART
## 1     1     2    6    1      1      8     8     0     1       8      1
## 2     2     3    5    0      2      7     7     1     2       6      3
## 3     5     0    4    0      7      2     7     0     2       9      0
## 4     2     1    4    0      5      4     9     0     0       7      2
## 5     0     0    0    0      4      5     6     2     1       5      4
## 6     2     2    4    2      9      0     5     3     3       9      0
##   MINKM30 MINK3045 MINK4575 MINK7512 MINK123M MINKGEM MKOOPKLA PWAPART
## 1       0        4        5        0        0       4        3       0
## 2       2        0        5        2        0       5        4       2
## 3       4        5        0        0        0       3        4       2
## 4       1        5        3        0        0       4        4       0
## 5       0        0        9        0        0       6        3       0
## 6       5        2        3        0        0       3        3       0
##   PWABEDR PWALAND PPERSAUT PBESAUT PMOTSCO PVRAAUT PAANHANG PTRACTOR
## 1       0       0        6       0       0       0        0        0
## 2       0       0        0       0       0       0        0        0
## 3       0       0        6       0       0       0        0        0
## 4       0       0        6       0       0       0        0        0
## 5       0       0        0       0       0       0        0        0
## 6       0       0        6       0       0       0        0        0
##   PWERKT PBROM PLEVEN PPERSONG PGEZONG PWAOREG PBRAND PZEILPL PPLEZIER
## 1      0     0      0        0       0       0      5       0        0
## 2      0     0      0        0       0       0      2       0        0
## 3      0     0      0        0       0       0      2       0        0
## 4      0     0      0        0       0       0      2       0        0
## 5      0     0      0        0       0       0      6       0        0
## 6      0     0      0        0       0       0      0       0        0
##   PFIETS PINBOED PBYSTAND AWAPART AWABEDR AWALAND APERSAUT ABESAUT AMOTSCO
## 1      0       0        0       0       0       0        1       0       0
## 2      0       0        0       2       0       0        0       0       0
## 3      0       0        0       1       0       0        1       0       0
## 4      0       0        0       0       0       0        1       0       0
## 5      0       0        0       0       0       0        0       0       0
## 6      0       0        0       0       0       0        1       0       0
##   AVRAAUT AAANHANG ATRACTOR AWERKT ABROM ALEVEN APERSONG AGEZONG AWAOREG
## 1       0        0        0      0     0      0        0       0       0
## 2       0        0        0      0     0      0        0       0       0
## 3       0        0        0      0     0      0        0       0       0
## 4       0        0        0      0     0      0        0       0       0
## 5       0        0        0      0     0      0        0       0       0
## 6       0        0        0      0     0      0        0       0       0
##   ABRAND AZEILPL APLEZIER AFIETS AINBOED ABYSTAND targetVar
## 1      1       0        0      0       0        0        No
## 2      1       0        0      0       0        0        No
## 3      1       0        0      0       0        0        No
## 4      1       0        0      0       0        0        No
## 5      1       0        0      0       0        0        No
## 6      0       0        0      0       0        0        No

2.a.ii) Dimensions of the dataset.

dim(xy_train)
## [1] 5822   86

2.a.iii) Types of the attributes.

sapply(xy_train, class)
##   MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##    MGODOV    MGODGE    MRELGE    MRELSA    MRELOV  MFALLEEN  MFGEKIND 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  MFWEKIND  MOPLHOOG  MOPLMIDD  MOPLLAAG  MBERHOOG  MBERZELF  MBERBOER 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  MBERMIDD  MBERARBG  MBERARBO      MSKA     MSKB1     MSKB2      MSKC 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##      MSKD    MHHUUR    MHKOOP     MAUT1     MAUT2     MAUT0   MZFONDS 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##    MZPART   MINKM30  MINK3045  MINK4575  MINK7512  MINK123M   MINKGEM 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  MKOOPKLA   PWAPART   PWABEDR   PWALAND  PPERSAUT   PBESAUT   PMOTSCO 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##   PVRAAUT  PAANHANG  PTRACTOR    PWERKT     PBROM    PLEVEN  PPERSONG 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##   PGEZONG   PWAOREG    PBRAND   PZEILPL  PPLEZIER    PFIETS   PINBOED 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  PBYSTAND   AWAPART   AWABEDR   AWALAND  APERSAUT   ABESAUT   AMOTSCO 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##   AVRAAUT  AAANHANG  ATRACTOR    AWERKT     ABROM    ALEVEN  APERSONG 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##   AGEZONG   AWAOREG    ABRAND   AZEILPL  APLEZIER    AFIETS   AINBOED 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  ABYSTAND targetVar 
## "integer"  "factor"

2.a.iv) Statistical summary of all attributes.

summary(xy_train)
##     MOSTYPE         MAANTHUI         MGEMOMV         MGEMLEEF    
##  Min.   : 1.00   Min.   : 1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.: 1.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :30.00   Median : 1.000   Median :3.000   Median :3.000  
##  Mean   :24.25   Mean   : 1.111   Mean   :2.679   Mean   :2.991  
##  3rd Qu.:35.00   3rd Qu.: 1.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :41.00   Max.   :10.000   Max.   :5.000   Max.   :6.000  
##     MOSHOOFD          MGODRK           MGODPR          MGODOV    
##  Min.   : 1.000   Min.   :0.0000   Min.   :0.000   Min.   :0.00  
##  1st Qu.: 3.000   1st Qu.:0.0000   1st Qu.:4.000   1st Qu.:0.00  
##  Median : 7.000   Median :0.0000   Median :5.000   Median :1.00  
##  Mean   : 5.774   Mean   :0.6965   Mean   :4.627   Mean   :1.07  
##  3rd Qu.: 8.000   3rd Qu.:1.0000   3rd Qu.:6.000   3rd Qu.:2.00  
##  Max.   :10.000   Max.   :9.0000   Max.   :9.000   Max.   :5.00  
##      MGODGE          MRELGE          MRELSA           MRELOV    
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.00  
##  1st Qu.:2.000   1st Qu.:5.000   1st Qu.:0.0000   1st Qu.:1.00  
##  Median :3.000   Median :6.000   Median :1.0000   Median :2.00  
##  Mean   :3.259   Mean   :6.183   Mean   :0.8835   Mean   :2.29  
##  3rd Qu.:4.000   3rd Qu.:7.000   3rd Qu.:1.0000   3rd Qu.:3.00  
##  Max.   :9.000   Max.   :9.000   Max.   :7.0000   Max.   :9.00  
##     MFALLEEN        MFGEKIND       MFWEKIND      MOPLHOOG    
##  Min.   :0.000   Min.   :0.00   Min.   :0.0   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:2.00   1st Qu.:3.0   1st Qu.:0.000  
##  Median :2.000   Median :3.00   Median :4.0   Median :1.000  
##  Mean   :1.888   Mean   :3.23   Mean   :4.3   Mean   :1.461  
##  3rd Qu.:3.000   3rd Qu.:4.00   3rd Qu.:6.0   3rd Qu.:2.000  
##  Max.   :9.000   Max.   :9.00   Max.   :9.0   Max.   :9.000  
##     MOPLMIDD        MOPLLAAG        MBERHOOG        MBERZELF    
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:3.000   1st Qu.:0.000   1st Qu.:0.000  
##  Median :3.000   Median :5.000   Median :2.000   Median :0.000  
##  Mean   :3.351   Mean   :4.572   Mean   :1.895   Mean   :0.398  
##  3rd Qu.:4.000   3rd Qu.:6.000   3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :5.000  
##     MBERBOER         MBERMIDD        MBERARBG       MBERARBO    
##  Min.   :0.0000   Min.   :0.000   Min.   :0.00   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.00   1st Qu.:1.000  
##  Median :0.0000   Median :3.000   Median :2.00   Median :2.000  
##  Mean   :0.5223   Mean   :2.899   Mean   :2.22   Mean   :2.306  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:3.00   3rd Qu.:3.000  
##  Max.   :9.0000   Max.   :9.000   Max.   :9.00   Max.   :9.000  
##       MSKA           MSKB1           MSKB2            MSKC      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :1.000   Median :2.000   Median :2.000   Median :4.000  
##  Mean   :1.621   Mean   :1.607   Mean   :2.203   Mean   :3.759  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :9.000  
##       MSKD           MHHUUR          MHKOOP          MAUT1     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.00  
##  1st Qu.:0.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:5.00  
##  Median :1.000   Median :4.000   Median :5.000   Median :6.00  
##  Mean   :1.067   Mean   :4.237   Mean   :4.772   Mean   :6.04  
##  3rd Qu.:2.000   3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:7.00  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :9.00  
##      MAUT2           MAUT0          MZFONDS          MZPART     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:1.000   1st Qu.:5.000   1st Qu.:1.000  
##  Median :1.000   Median :2.000   Median :7.000   Median :2.000  
##  Mean   :1.316   Mean   :1.959   Mean   :6.277   Mean   :2.729  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:8.000   3rd Qu.:4.000  
##  Max.   :7.000   Max.   :9.000   Max.   :9.000   Max.   :9.000  
##     MINKM30         MINK3045        MINK4575        MINK7512     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :2.000   Median :4.000   Median :3.000   Median :0.0000  
##  Mean   :2.574   Mean   :3.536   Mean   :2.731   Mean   :0.7961  
##  3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:1.0000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :9.0000  
##     MINK123M         MINKGEM         MKOOPKLA        PWAPART      
##  Min.   :0.0000   Min.   :0.000   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:0.0000  
##  Median :0.0000   Median :4.000   Median :4.000   Median :0.0000  
##  Mean   :0.2027   Mean   :3.784   Mean   :4.236   Mean   :0.7712  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:6.000   3rd Qu.:2.0000  
##  Max.   :9.0000   Max.   :9.000   Max.   :8.000   Max.   :3.0000  
##     PWABEDR           PWALAND           PPERSAUT       PBESAUT       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :5.00   Median :0.00000  
##  Mean   :0.04002   Mean   :0.07162   Mean   :2.97   Mean   :0.04827  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:6.00   3rd Qu.:0.00000  
##  Max.   :6.00000   Max.   :4.00000   Max.   :8.00   Max.   :7.00000  
##     PMOTSCO          PVRAAUT            PAANHANG          PTRACTOR      
##  Min.   :0.0000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.1754   Mean   :0.009447   Mean   :0.02096   Mean   :0.09258  
##  3rd Qu.:0.0000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :7.0000   Max.   :9.000000   Max.   :5.00000   Max.   :6.00000  
##      PWERKT            PBROM           PLEVEN          PPERSONG      
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.000   Median :0.0000   Median :0.00000  
##  Mean   :0.01305   Mean   :0.215   Mean   :0.1948   Mean   :0.01374  
##  3rd Qu.:0.00000   3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :6.00000   Max.   :6.000   Max.   :9.0000   Max.   :6.00000  
##     PGEZONG           PWAOREG            PBRAND         PZEILPL         
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.0000000  
##  Median :0.00000   Median :0.00000   Median :2.000   Median :0.0000000  
##  Mean   :0.01529   Mean   :0.02353   Mean   :1.828   Mean   :0.0008588  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:4.000   3rd Qu.:0.0000000  
##  Max.   :3.00000   Max.   :7.00000   Max.   :8.000   Max.   :3.0000000  
##     PPLEZIER           PFIETS           PINBOED           PBYSTAND      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.01889   Mean   :0.02525   Mean   :0.01563   Mean   :0.04758  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :6.00000   Max.   :1.00000   Max.   :6.00000   Max.   :5.00000  
##     AWAPART         AWABEDR           AWALAND           APERSAUT     
##  Min.   :0.000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.000   Median :0.00000   Median :0.00000   Median :1.0000  
##  Mean   :0.403   Mean   :0.01477   Mean   :0.02061   Mean   :0.5622  
##  3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :2.000   Max.   :5.00000   Max.   :1.00000   Max.   :7.0000  
##     ABESAUT           AMOTSCO           AVRAAUT            AAANHANG      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.000000   Median :0.00000  
##  Mean   :0.01048   Mean   :0.04105   Mean   :0.002233   Mean   :0.01254  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :4.00000   Max.   :8.00000   Max.   :3.000000   Max.   :3.00000  
##     ATRACTOR           AWERKT             ABROM             ALEVEN       
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.03367   Mean   :0.006183   Mean   :0.07042   Mean   :0.07661  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :4.00000   Max.   :6.000000   Max.   :2.00000   Max.   :8.00000  
##     APERSONG           AGEZONG            AWAOREG             ABRAND      
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :0.000000   Median :0.000000   Median :0.000000   Median :1.0000  
##  Mean   :0.005325   Mean   :0.006527   Mean   :0.004638   Mean   :0.5701  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:1.0000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :2.000000   Max.   :7.0000  
##     AZEILPL             APLEZIER            AFIETS       
##  Min.   :0.0000000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.0000000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.0000000   Median :0.000000   Median :0.00000  
##  Mean   :0.0005153   Mean   :0.006012   Mean   :0.03178  
##  3rd Qu.:0.0000000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.0000000   Max.   :2.000000   Max.   :3.00000  
##     AINBOED            ABYSTAND       targetVar 
##  Min.   :0.000000   Min.   :0.00000   Yes: 348  
##  1st Qu.:0.000000   1st Qu.:0.00000   No :5474  
##  Median :0.000000   Median :0.00000             
##  Mean   :0.007901   Mean   :0.01426             
##  3rd Qu.:0.000000   3rd Qu.:0.00000             
##  Max.   :2.000000   Max.   :2.00000

2.a.v) Summarize the levels of the class attribute.

#entireDataset_x <- entireDataset[,1:(totCol-1)]
#entireDataset_y <- entireDataset[,totCol]
cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
##     freq percentage
## Yes  348   5.977327
## No  5474  94.022673

2.a.vi) Count missing values.

sapply(xy_train, function(x) sum(is.na(x)))
##   MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR 
##         0         0         0         0         0         0         0 
##    MGODOV    MGODGE    MRELGE    MRELSA    MRELOV  MFALLEEN  MFGEKIND 
##         0         0         0         0         0         0         0 
##  MFWEKIND  MOPLHOOG  MOPLMIDD  MOPLLAAG  MBERHOOG  MBERZELF  MBERBOER 
##         0         0         0         0         0         0         0 
##  MBERMIDD  MBERARBG  MBERARBO      MSKA     MSKB1     MSKB2      MSKC 
##         0         0         0         0         0         0         0 
##      MSKD    MHHUUR    MHKOOP     MAUT1     MAUT2     MAUT0   MZFONDS 
##         0         0         0         0         0         0         0 
##    MZPART   MINKM30  MINK3045  MINK4575  MINK7512  MINK123M   MINKGEM 
##         0         0         0         0         0         0         0 
##  MKOOPKLA   PWAPART   PWABEDR   PWALAND  PPERSAUT   PBESAUT   PMOTSCO 
##         0         0         0         0         0         0         0 
##   PVRAAUT  PAANHANG  PTRACTOR    PWERKT     PBROM    PLEVEN  PPERSONG 
##         0         0         0         0         0         0         0 
##   PGEZONG   PWAOREG    PBRAND   PZEILPL  PPLEZIER    PFIETS   PINBOED 
##         0         0         0         0         0         0         0 
##  PBYSTAND   AWAPART   AWABEDR   AWALAND  APERSAUT   ABESAUT   AMOTSCO 
##         0         0         0         0         0         0         0 
##   AVRAAUT  AAANHANG  ATRACTOR    AWERKT     ABROM    ALEVEN  APERSONG 
##         0         0         0         0         0         0         0 
##   AGEZONG   AWAOREG    ABRAND   AZEILPL  APLEZIER    AFIETS   AINBOED 
##         0         0         0         0         0         0         0 
##  ABYSTAND targetVar 
##         0         0

2.b) Data visualizations

2.b.i) Univariate plots to better understand each attribute.

# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    boxplot(x_train[,i], main=names(x_train)[i])
}

# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    hist(x_train[,i], main=names(x_train)[i])
}

# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    plot(density(x_train[,i]), main=names(x_train)[i])
}

2.b.ii) Multivariate plots to better understand the relationships between attributes

# Scatterplot matrix colored by class
# pairs(targetVar~., data=xy_train, col=xy_train$targetVar)
# Box and whisker plots for each attribute by class
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x_train, y=y_train, plot="box", scales=scales)

# Density plots for each attribute by class value
featurePlot(x=x_train, y=y_train, plot="density", scales=scales)

# Correlation plot
correlations <- cor(x_train)
corrplot(correlations, method="circle")

email_notify(paste("Data Summarization and Visualization completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@357246de}"

3. Prepare Data

Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:

email_notify(paste("Data Cleaning and Transformation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@23223dd8}"

3.a) Data Cleaning

# Accodring to the data dictionary, columns MOSTYPE and MOSHOOFD should be converted to categorical type
xy_train$MOSTYPE <- as.factor(xy_train$MOSTYPE)
xy_train$MOSHOOFD <- as.factor(xy_train$MOSHOOFD)
xy_test$MOSTYPE <- as.factor(xy_test$MOSTYPE)
xy_test$MOSHOOFD <- as.factor(xy_test$MOSHOOFD)

3.b) Feature Selection

# Not applicable for this iteration of the project.

3.c) Data Transforms

# Not applicable for this iteration of the project.

3.d) Display the Final Dataset for Model-Building

dim(xy_train)
## [1] 5822   86
sapply(xy_train, class)
##   MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR 
##  "factor" "integer" "integer" "integer"  "factor" "integer" "integer" 
##    MGODOV    MGODGE    MRELGE    MRELSA    MRELOV  MFALLEEN  MFGEKIND 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  MFWEKIND  MOPLHOOG  MOPLMIDD  MOPLLAAG  MBERHOOG  MBERZELF  MBERBOER 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  MBERMIDD  MBERARBG  MBERARBO      MSKA     MSKB1     MSKB2      MSKC 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##      MSKD    MHHUUR    MHKOOP     MAUT1     MAUT2     MAUT0   MZFONDS 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##    MZPART   MINKM30  MINK3045  MINK4575  MINK7512  MINK123M   MINKGEM 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  MKOOPKLA   PWAPART   PWABEDR   PWALAND  PPERSAUT   PBESAUT   PMOTSCO 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##   PVRAAUT  PAANHANG  PTRACTOR    PWERKT     PBROM    PLEVEN  PPERSONG 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##   PGEZONG   PWAOREG    PBRAND   PZEILPL  PPLEZIER    PFIETS   PINBOED 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  PBYSTAND   AWAPART   AWABEDR   AWALAND  APERSAUT   ABESAUT   AMOTSCO 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##   AVRAAUT  AAANHANG  ATRACTOR    AWERKT     ABROM    ALEVEN  APERSONG 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##   AGEZONG   AWAOREG    ABRAND   AZEILPL  APLEZIER    AFIETS   AINBOED 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  ABYSTAND targetVar 
## "integer"  "factor"
email_notify(paste("Data Cleaning and Transformation completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5a01ccaa}"
proc.time()-startTimeScript
##    user  system elapsed 
##  57.782   0.879  68.196

4. Model and Evaluate Algorithms

After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:

For this project, we will evaluate one linear, three non-linear, and three ensemble algorithms:

Linear Algorithm: Logistic Regression

Non-Linear Algorithms: Decision Trees (CART), k-Nearest Neighbors, and Support Vector Machine

Ensemble Algorithms: Bagged CART, Random Forest, and Stochastic Gradient Boosting

The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.

4.a) Generate models using linear algorithms

# Logistic Regression (Classification)
email_notify(paste("Linear Regression modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1fbc7afb}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.glm <- train(targetVar~., data=xy_train, method="glm", metric=metricTarget, trControl=control)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(fit.glm)
## Generalized Linear Model 
## 
## 5822 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ... 
## Resampling results:
## 
##   ROC        Sens        Spec     
##   0.7207725  0.01142857  0.9965292
proc.time()-startTimeModule
##    user  system elapsed 
##  18.403   0.197  18.846
email_notify(paste("Linear Regression modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@c818063}"

4.b) Generate models using nonlinear algorithms

# Decision Tree - CART (Regression/Classification)
email_notify(paste("Decision Tree modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2c8d66b2}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART 
## 
## 5822 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ... 
## Resampling results across tuning parameters:
## 
##   cp           ROC        Sens        Spec     
##   0.000862069  0.7137003  0.05731092  0.9873927
##   0.002431477  0.6905597  0.02588235  0.9917786
##   0.003831418  0.6020277  0.01428571  0.9967093
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.000862069.
proc.time()-startTimeModule
##    user  system elapsed 
##   7.197   0.010   7.298
email_notify(paste("Decision Tree modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2752f6e2}"
# k-Nearest Neighbors (Regression/Classification)
email_notify(paste("k-Nearest Neighbors modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1d251891}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.knn <- train(targetVar~., data=xy_train, method="knn", metric=metricTarget, trControl=control)
print(fit.knn)
## k-Nearest Neighbors 
## 
## 5822 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ... 
## Resampling results across tuning parameters:
## 
##   k  ROC        Sens         Spec     
##   5  0.6077055  0.008571429  0.9936031
##   7  0.6152044  0.000000000  0.9981732
##   9  0.6275954  0.000000000  0.9998175
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
proc.time()-startTimeModule
##    user  system elapsed 
## 243.999   0.034 246.657
email_notify(paste("k-Nearest Neighbors modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6e8dacdf}"
# Support Vector Machine (Regression/Classification)
email_notify(paste("Support Vector Machine modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@7f63425a}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.svm <- train(targetVar~., data=xy_train, method="svmRadial", metric=metricTarget, trControl=control)
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 5822 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ... 
## Resampling results across tuning parameters:
## 
##   C     ROC        Sens         Spec     
##   0.25  0.6385720  0.002857143  0.9990869
##   0.50  0.6389426  0.002857143  0.9990869
##   1.00  0.6394141  0.002857143  0.9989041
## 
## Tuning parameter 'sigma' was held constant at a value of 0.00555605
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.00555605 and C = 1.
proc.time()-startTimeModule
##    user  system elapsed 
## 210.238   1.994 214.576
email_notify(paste("Support Vector Machine modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@59a6e353}"

4.c) Generate models using ensemble algorithms

In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.

# Bagged CART (Regression/Classification)
email_notify(paste("Bagged CART modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2812cbfa}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART 
## 
## 5822 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ... 
## Resampling results:
## 
##   ROC        Sens        Spec     
##   0.6790771  0.07487395  0.9738721
proc.time()-startTimeModule
##    user  system elapsed 
##  85.109   0.605  86.663
email_notify(paste("Bagged CART modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6aaa5eb0}"
# Random Forest (Regression/Classification)
email_notify(paste("Random Forest modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@380fb434}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest 
## 
## 5822 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens        Spec     
##     2   0.6854831  0.00000000  1.0000000
##    66   0.7145401  0.06058824  0.9795374
##   131   0.7211056  0.06630252  0.9755181
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 131.
proc.time()-startTimeModule
##     user   system  elapsed 
## 1336.521    2.029 1353.086
email_notify(paste("Random Forest modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@42d3bd8b}"
# Stochastic Gradient Boosting (Regression/Classification)
email_notify(paste("Stochastic Gradient Boosting modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4e04a765}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## Stochastic Gradient Boosting 
## 
## 5822 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  ROC        Sens         Spec     
##   1                   50      0.7650455  0.000000000  0.9994526
##   1                  100      0.7677287  0.002857143  0.9987223
##   1                  150      0.7681374  0.002857143  0.9985395
##   2                   50      0.7647211  0.002857143  0.9985391
##   2                  100      0.7726834  0.005714286  0.9972601
##   2                  150      0.7735426  0.014285714  0.9967123
##   3                   50      0.7728907  0.005714286  0.9981728
##   3                  100      0.7744829  0.020084034  0.9956161
##   3                  150      0.7668234  0.028739496  0.9936055
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 100,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
##    user  system elapsed 
##  84.865   0.168  85.933
email_notify(paste("Stochastic Gradient Boosting modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1e67b872}"

4.d) Compare baseline algorithms

results <- resamples(list(LR=fit.glm, CART=fit.cart, kNN=fit.knn, SVM=fit.svm, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: LR, CART, kNN, SVM, BagCART, RF, GBM 
## Number of resamples: 10 
## 
## ROC 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LR      0.5424654 0.7187908 0.7346121 0.7207725 0.7505980 0.7771116    0
## CART    0.6335597 0.6848133 0.7312430 0.7137003 0.7401341 0.7581856    0
## kNN     0.5788965 0.6083592 0.6166638 0.6275954 0.6564493 0.6819796    0
## SVM     0.5814390 0.6095456 0.6461774 0.6394141 0.6642008 0.6870177    0
## BagCART 0.6374771 0.6631625 0.6735072 0.6790771 0.6888940 0.7432742    0
## RF      0.6798642 0.7136872 0.7252546 0.7211056 0.7376468 0.7489312    0
## GBM     0.7005746 0.7625714 0.7835808 0.7744829 0.7969225 0.8023981    0
## 
## Sens 
##               Min.    1st Qu.     Median        Mean    3rd Qu.       Max.
## LR      0.00000000 0.00000000 0.00000000 0.011428571 0.02857143 0.02857143
## CART    0.00000000 0.05714286 0.05714286 0.057310924 0.07899160 0.08571429
## kNN     0.00000000 0.00000000 0.00000000 0.000000000 0.00000000 0.00000000
## SVM     0.00000000 0.00000000 0.00000000 0.002857143 0.00000000 0.02857143
## BagCART 0.02857143 0.05714286 0.08571429 0.074873950 0.08760504 0.11764706
## RF      0.02857143 0.05714286 0.05714286 0.066302521 0.08571429 0.11764706
## GBM     0.00000000 0.00000000 0.00000000 0.020084034 0.02920168 0.08571429
##         NA's
## LR         0
## CART       0
## kNN        0
## SVM        0
## BagCART    0
## RF         0
## GBM        0
## 
## Spec 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LR      0.9926874 0.9949726 0.9963504 0.9965292 0.9981718 1.0000000    0
## CART    0.9744059 0.9840037 0.9872146 0.9873927 0.9890511 1.0000000    0
## kNN     0.9981752 1.0000000 1.0000000 0.9998175 1.0000000 1.0000000    0
## SVM     0.9963437 0.9981727 1.0000000 0.9989041 1.0000000 1.0000000    0
## BagCART 0.9634369 0.9693784 0.9744059 0.9738721 0.9781022 0.9817518    0
## RF      0.9670932 0.9712066 0.9771698 0.9755181 0.9781022 0.9835766    0
## GBM     0.9908592 0.9931544 0.9963437 0.9956161 0.9977165 1.0000000    0
dotplot(results)

cat('The average ROC from all models is:', mean(c(results$values$`LR~ROC`, results$values$`CART~ROC`, results$values$`kNN~ROC`, results$values$`SVM~ROC`, results$values$`BagCART~ROC`, results$values$`RF~ROC`, results$values$`GBM~ROC`)))
## The average ROC from all models is: 0.6965926

5. Improve Accuracy or Results

After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.

Using the three best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.

5.a) Algorithm Tuning

Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.

# Tuning algorithm #1 - Decision Tree (CART)
email_notify(paste("Algorithm #1 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@77b52d12}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(cp = c(0.0001, 0.0005, 0.001, 0.005, 0.01))
fit.final1 <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)

print(fit.final1)
## CART 
## 
## 5822 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ... 
## Resampling results across tuning parameters:
## 
##   cp     ROC        Sens         Spec     
##   1e-04  0.7034841  0.065966387  0.9853831
##   5e-04  0.7051201  0.065966387  0.9857487
##   1e-03  0.7137003  0.057310924  0.9873927
##   5e-03  0.5205876  0.002857143  0.9994516
##   1e-02  0.5000000  0.000000000  1.0000000
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001.
proc.time()-startTimeModule
##    user  system elapsed 
##   5.703   0.035   5.802
email_notify(paste("Algorithm #1 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4157f54e}"
# Tuning algorithm #2 - Random Forest (RF)
email_notify(paste("Algorithm #2 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@6615435c}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry = c(5, 10, 20, 35, 50))
fit.final2 <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final2)

print(fit.final2)
## Random Forest 
## 
## 5822 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 5240, 5240, 5241, 5240, 5240, 5240, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens         Spec     
##    5    0.7027004  0.005798319  0.9978076
##   10    0.7100881  0.028823529  0.9908636
##   20    0.7154303  0.037563025  0.9870271
##   35    0.7153575  0.057731092  0.9828244
##   50    0.7159008  0.057731092  0.9809985
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 50.
proc.time()-startTimeModule
##     user   system  elapsed 
## 1620.398    2.866 1641.105
email_notify(paste("Algorithm #2 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@7225790e}"

5.d) Compare Algorithms After Tuning

results <- resamples(list(CART=fit.final1, RF=fit.final2))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: CART, RF 
## Number of resamples: 10 
## 
## ROC 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART 0.6335597 0.6848133 0.7312430 0.7137003 0.7401341 0.7581856    0
## RF   0.6791329 0.6999891 0.7189867 0.7159008 0.7301532 0.7521116    0
## 
## Sens 
##      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. NA's
## CART    0 0.05714286 0.05714286 0.05731092 0.07899160 0.08571429    0
## RF      0 0.03571429 0.05714286 0.05773109 0.07857143 0.11764706    0
## 
## Spec 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART 0.9744059 0.9840037 0.9872146 0.9873927 0.9890511 1.0000000    0
## RF   0.9689214 0.9789762 0.9817351 0.9809985 0.9849453 0.9872263    0
dotplot(results)

6. Finalize Model and Present Results

Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:

email_notify(paste("Model Validation and Final Model Creation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@52af6cff}"

6.a) Predictions on validation dataset

predictions <- predict(fit.final2, newdata=xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes   17   53
##        No   221 3709
##                                           
##                Accuracy : 0.9315          
##                  95% CI : (0.9232, 0.9391)
##     No Information Rate : 0.9405          
##     P-Value [Acc > NIR] : 0.9917          
##                                           
##                   Kappa : 0.0857          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.07143         
##             Specificity : 0.98591         
##          Pos Pred Value : 0.24286         
##          Neg Pred Value : 0.94377         
##              Prevalence : 0.05950         
##          Detection Rate : 0.00425         
##    Detection Prevalence : 0.01750         
##       Balanced Accuracy : 0.52867         
##                                           
##        'Positive' Class : Yes             
## 
pred <- prediction(as.numeric(predictions), as.numeric(y_test))
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize=TRUE)

auc <- performance(pred, measure = "auc")
cat('The area under the curve (AUC) value is:', auc@y.values[[1]])
## The area under the curve (AUC) value is: 0.5286702

6.b) Create standalone model on entire training dataset

startTimeModule <- proc.time()
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(seedNum)

# Combining the training and test datasets to form the original dataset that will be used for training the final model
# xy_train <- rbind(xy_train, xy_test)

finalModel <- randomForest(targetVar~., data=xy_train, mtry=50)
print(finalModel)
## 
## Call:
##  randomForest(formula = targetVar ~ ., data = xy_train, mtry = 50) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 50
## 
##         OOB estimate of  error rate: 7.71%
## Confusion matrix:
##     Yes   No class.error
## Yes  24  324  0.93103448
## No  125 5349  0.02283522
proc.time()-startTimeModule
##    user  system elapsed 
##  33.522   0.113  34.004

6.c) Save model for later use

#saveRDS(finalModel, "./finalModel_BinaryClass.rds")
email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3dd3bcd}"
proc.time()-startTimeScript
##     user   system  elapsed 
## 3710.458    9.149 3801.231